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Abstract 

We present new results on the relation between purely symbolic context- 
free parsing strategies and their probabilistic counter-parts. Such parsing 
strategies are seen as constructions of push-down devices from grammars. 
We show that preservation of probability distribution is possible under two 
conditions, viz. the correct-prefix property and the property of strong predic¬ 
tiveness. These results generalize existing results in the literature that were 
obtained by considering parsing strategies in isolation. From our general 
results we also derive negative results on so-called generalized LR parsing. 


1 Introduction 


Context-free grammars and push-down automata are two equivalent formalisms 
to describe context-free languages. While a context-free grammar can be thought 
of as a purely declarative specification, a push-down automaton is considered to 
be an operational specification that determines which steps are performed for a 
given string in the process of deciding its membership of the language. By a 
parsing strategy we mean a mapping from context-free grammars to equivalent 
push-down automata, such that some specific conditions are observed. 

This paper deals with the probabilistic extensions of context-free grammars 
and push-down automata, i.e., probabilistic context-free grammars |^, ^ and 
probabilistic push-down automata 


37, 


These formalisms are obtained 
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by adding probabilities to the rules and transitions of context-free grammars 
and push-down automata, respectively. More specifically, we will investigate 
the problem of ‘extending’ parsing strategies to probabilistic parsing strategies. 
These are mappings from probabilistic context-free grammars to probabilistic 
push-down automata that preserve the induced probability distributions on the 
generated/accepted languages. Two of the main results presented in this paper 
can be stated as follows: 


• No parsing strategy that lacks the correct-prefix property (CPP) can be 
extended to become a probabilistic parsing strategy. 

• All parsing strategies that possess the correct-prehx property and the strong 
predictiveness property (SPP) can be extended to become probabilistic 
parsing strategies. 


The above results generalize previous findings reported in 


, Q], where only 

a few specific parsing strategies were considered in isolation. Our findings also 
have important implications for well-known parsing strategies such as generalized 
LR parsing, henceforth simply called ‘LR parsing’.]^ LR parsing has the CPP, but 
lacks the SPP, and as we will show, LR parsing cannot be extended to become a 
probabilistic parsing strategy. 

In the last decade, widespread interest in probabilistic parsing techniques has 
arisen in the area of natural language processing ^ This is motivated 

by the fact that natural language sentences are generally ambiguous, and natural 
language software needs to be able to distinguish the more probable derivations 
of a sentence from the less probable ones. This can be achieved by letting the 
parsing process assign a probability to each parse, on the basis of a probabilistic 
grammar. In a typical application, the software may select those derivations for 
further processing that have been given the highest probabilities, and discard the 
others. The success of this approach relies on the accuracy of the probabilistic 
model expressed by the probabilistic grammar, i.e., whether the probabilities 
assigned to derivations accurately reflect the ‘true’ probabilities in the domain at 
hand. 

Probabilities are often estimated on the basis of a corpus, i.e., a collection 
of sentences. The sentences in a corpus may be annotated with various kinds 
of information. One kind of annotation that is relevant for our discussion is the 
preferred derivation for each sentence. Given a corpus with derivations, one may 
estimate probabilities of rules by their relative frequencies in the corpus. If a 
corpus is unannotated, more general techniques of maximum-likelihood estima¬ 
tion can be used to estimate the probabilities of rules. (See n, ini, 0] for some 
formal properties of types of maximum-likelihood estimation.) 

The motivation for studying probabilistic models other than those obtained 
by attaching probabilities to given context-free grammars is the observation that 


^Generalized (or nondeterministic) LR parsing allows for more than one action for a given 
LR state and input symbol. 
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more accurate models can be obtained by conditioning probabilities on ‘context 
information’ beyond single nonterminals Furthermore, it has been ob¬ 

served that conditioning on certain types of context information can be achieved 
by first translating context-free grammars to push-down automata, according to 
some parsing strategy, and then attaching probabilities to the transitions thereof 
[^, ^]. More concretely, for some parsing strategies, the set of models that can 
be obtained by attaching probabilities to a push-down automaton constructed 
from a context-free grammar may include models that cannot be obtained by 
attaching probabilities to that grammar. 

An implicit assumption of this methodology is that, conversely, any proba¬ 
bilistic model that can be obtained from a grammar can also be obtained from the 
associated push-down automaton, or in other words, the push-down automaton 
is at least as powerful as the grammar in terms of the set the potential models. If 
a parsing strategy does not satisfy this property, and if some potential models are 
lost in the mapping from the grammar to the push-down automaton, then this 
means that in some cases the strategy may lead to less rather than more accurate 
models. That LR parsing cannot be extended to become a probabilistic parsing 
strategy, as we mentioned above, means that the above property is not satisfied 
by this parsing strategy. This is contrary to what is suggested by some publica¬ 
tions on probabilistic LR parsing, such as Q and |I^, which fail to observe that 
LR parsers may sometimes lead to less accurate models than the grammars from 
which they were constructed. 

Some studies, such as |^, ^, |^, propose lexicalized probabilistic context-free 
grammars, i.e., probabilistic models based on context-free grammars in which 
probabilities heavily rely on the terminal elements from input strings. Even if 
the current paper does not specifically deal with lexicalization, much of what we 
discuss pertains to lexicalized probabilistic context-free grammars as well. 

The paper is organized as follows. After giving standard definitions in Sec¬ 
tion we give our formal definition of ‘parsing strategy’ in Section |3|. We also 
define what it means to extend a parsing strategy to become a probabilistic pars¬ 
ing strategy. The CPP and the SPP are defined in Sections and where 
we also discuss how these properties relate to the question of which strategies 
can be extended to become probabilistic. Sections and ^ provide examples of 
parsing strategies with and without the SPP. The examples without the SPP, 
most notably LR parsing, are shown not to be extendible to become probabilis¬ 
tic. A wider notion of extending a strategy to become probabilistic is provided 
by Section We show that even under this wider notion, LR parsing cannot be 
extended to become probabilistic. Section ^ presents an application that concerns 
prefix probabilities. We end this paper with conclusions. 

Some results reported here have appeared before in an abbreviated form in 
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2 Preliminaries 

A context-free grammar (CFG) Q is a 4-tuple (A, N, S, R), where A is a finite 
set of terminals^ called the alphabet, N is a hnite set of nonterminals, including 
the start symbol S, and i? is a hnite set of rules, each of the form A —> a, where 
A ^ N and a G (A U N)*. Without loss of generality, we assume that there is 
only one rule S ^ a with the start symbol in the left-hand side, and furthermore 
that CT / e, where e denotes the empty string. 

For a hxed CFG Q, we dehne the relation on triples consisting of two strings 
a, P € {S U N)* and a rule tt G i? by: a ^ /3 if and only if a is of the form wA5 
and P is of the form w^5, for some w G E* and 6 G (AUA^)*, and vr = (A ^ 7). A 
left-most derivation is a string d = tti • • • vr^, m > 0, such that 5 ^ a, for 

some a G (A U N)*. We will identify a left-most derivation with the sequence of 
strings over E U N that arise in that derivation. In the remainder of this paper, 
we will let the term ‘derivation’ refer to ‘left-most derivation’, unless specified 
otherwise. 

A derivation d = tti • • • tt^, m > 0, such that S ^ ^ w where w G E* 

will be called a complete derivation; we also say that d is a derivation of w. 
By subderivation we mean a substring of a complete derivation of the form d = 
TTi ■ ■ ■ TTm, m > 0, such that A ^ ^ w for some A and w. 

We write a =4-* P or a P to denote the existence of a string tti • • • vr^ such 
that a ^ ^ P, with m > 0 or m > 0, respectively. We say a CFG is acyclic 

if A =^'*“ A does not hold for any A G N. 

For a CFG G we dehne the language L{G) it generates as the set of strings w 
such that there is at least one derivation of w. We say a CFG is reduced if for 
each rule n G R there is a complete derivation in which it occurs. 

A probabilistic context-free grammar (PCFG) is a pair {G,p) consisting of a 
CFG G = {E, N, S, R) and a probability function p from R to real numbers in 
the interval [0,1]. We say a PCFG is proper if p(7r) = 1 for each 

AgN. 

For a PCFG {G,p), we dehne the probability p{d) of a string d = tti • • • iTm G 
R* as n£=i we will in particular consider the probabilities of derivations 

d. The probability p{w) of a string w G E* as dehned by {G,p) is the sum of the 
probabilities of all derivations of that string. We say a PCFG {G,p) is consistent 
if p{w) = 1. 

In this paper we will mainly consider push-down transducers rather than 
push-down automata. Push-down transducers not only compute derivations of 
the grammar while processing an input string, but they also explicitly produce 
output strings from which these derivations can be obtained. We use transducers 
for two reasons. First, constraints on the output strings allow us to restrict 
our attention to ‘reasonable’ parsing strategies. Those strategies that cannot 
be formalized within these constraints are unlikely to be of practical interest. 
Secondly, mappings from input strings to derivations, as those realized by push¬ 
down devices, turn out to be a very powerful abstraction and allow direct proofs 
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of several general results. 

Differently from many textbooks, our push-down devices do not possess states 
next to stack symbols. This is without loss of generality, since states can be 
encoded into the stack symbols, given the types of transition that we allow. 
Thus, a push-down transducer (PDT) ^ is a 6 -tuple (Ti, Q, Xinu, Xfi^ai, 
A), where is the input alphabet, E 2 is the output alphabet, Q is a finite 
set of stack symbols including the initial stack symbol Xinu and the final stack 
symbol Xfinal, and A is the set of transitions. Each transition can have one of the 
following three forms: X 1 -^ XY (a push transition), YX Z (a pop transition), 
or X ^ Y (a swap transition); here X,Y,Z€Q,x€E.^U {e} and y £ E*. 
Note that in our notation, stacks grow from left to right, i.e., the top-most stack 
symbol will be found at the right end. 

Without loss of generality, we assume that any PDT is such that for a given 
stack symbol X Xfinal, there are either one or more push transitions X 1 —> XY, 
or one or more pop transitions YX 1 -^ Z, or one or more swap transitions X ^ Y, 
but no combinations of different types of transition. If a PDT does not satisfy 
this normal form, it can easily be brought in this form by introducing for each 
stack symbol X three new stack symbols Xpush, Xpop and Xg^iap and new swap 
transitions X ^ Xp^sh, X ^ Xpop and X ^ Xg^iap- In each existing transition 
that operates on top-of-stack X, we then replace X by one from Xpush, Xpop or 
Xswap, depending on the type of that transition. We also assume that Xfinal does 
not occur in the left-hand side of a transition, again without loss of generality. 

A configuration of a PDT is a triple {a, w, v), where a £ Q* is a stack, w £ E^ 
is the remaining input, and u G T* is the output generated so far. For a fixed 
PDT A, we define the relation h on triples consisting of two configurations and 

T 

a transition r by; ( 7 a, xw, v) h ( 7 /?, w, vy) if and only if r is of the form a 1 —> /?, 
where x = y = e, or of the form (3. A computation on an input string w is a 

I'm 

string c = Ti • • • r^, m > 0, such that {Xinit,w, e) h • • • h (a, w', v). A complete 
computation on a string tc is a computation with w' = e and a = Xfinal- The 
string V is called the output of the computation c, and is denoted by out{c). 

We will identify a computation with the sequence of configurations that arise 
in that computation, where the first configuration is determined by the context. 

C 

We also write {a,w,v) h* {f5,w',v') or {a,w,v) h* for a,(5 £ Q*, 

w,w' £ E* and v,v' £ E*, to indicate that {j3,w',v') can be obtained from 
(a, w, v) by applying a sequence c of zero or more transitions; we refer to such a 
sequence c as a subeomputation. The function out is extended to subcomputations 
in a natural way. 

For a PDT A, we define the language L{A) it accepts as the set of strings 
w such that there is at least one complete computation on w. We say a PDT is 
reduced if each transition t £ A occurs in some complete computation. 

A probabilistic push-down transducer (PPDT) is a pair {A,p) consisting of a 
PDT A and a probability function p from the set A of transitions of A to real 
numbers in the interval [0,1]. We say a PPDT {A,p) is proper if 
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• ^t={x^xy)&a Pi^) = 1 for each X G Q such that there is at least one 
transition X XY, Y G Q] 

• S . p(t) = 1 for each X £ Q such that there is at least one 

r=[X ^Y)GA ^ ^ ^ 

transition X y, x G Xi U {e}, y £ X*,Y £ Q] and 

• Sr=(yxMZ)Gzi p{t) = 1) for each X,Y £ Q such that there is at least one 

transition YX Z, Z £ Q. 

For a PPDT {A,p), we define the probability p{c) of a (sub)computation 
c = Ti • • • Tm as rii^i p{Ti)- The probability p{w) of a string w as defined by 
{A, p) is the sum of the probabilities of all complete computations on that string. 
We say a PPDT {A,p) is consistent if piw) = 1. 

We say a PCFG {G,p) is reduced if Q is reduced, and we say a PPDT {A,p) 
is reduced if A is reduced. 

3 Parsing strategies 

The term ‘parsing strategy’ is often used informally to refer to a class of parsing 
algorithms that behave similarly in some way. In this paper, we assign a formal 
meaning to this term, relying on the observation by |^, ^ that many parsing 
algorithms for CFGs can be described in two steps. The first is a construction 
of push-down devices from GFGs, and the second is a method for handling non¬ 
determinism (e.g. backtracking or dynamic programming). Parsing algorithms 
that handle nondeterminism in different ways but apply the same construction 
of push-down devices from GFGs are seen as realizations of the same parsing 
strategy. 

Thus, we define a parsing strategy to be a function S that maps a reduced 
GFG Q = (Xi, X, S, R) to a pair 5(^) = {A, f) consisting of a reduced PDT 
A = (Xi, X 2 , Q, Xinit, Xfinai, A), and a function / that maps a subset of X* to 
a subset of R*, with the following properties: 

• R <Z X 2 . 

• For each string w £ S* and each complete computation c on tc, f{out{c)) = 
d is a derivation of w. Furthermore, each symbol from R occurs as often in 
out{c) as it occurs in d. 

• Gonversely, for each string w £ X* and each derivation d oi w, there is 
precisely one complete computation c on tc such that f{out{c)) = d. 

If c is a complete computation, we will write /(c) to denote f{out{v)). The 
conditions above then imply that / is a bijection from complete computations to 
complete derivations. 

Note that output strings of (complete) computations may contain symbols 
that are not in R, and the symbols that are in R may occur in a different order in 
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V than in f{v) = d. The purpose of the symbols m. — R is to help this process 
of reordering of symbols in R. For a string v & X* we let v refer to the maximal 
subsequence of symbols from v that belong to R, or in other words, string v is 
obtained by erasing from v all occurrences of symbols from X^ — R. 

A probabilistic parsing strategy is defined to be a function S that maps a 
reduced, proper and consistent PCFG {Q,pg) to a triple S{Q,pg) = {A,p_A,f), 
where {A,p_a) is a reduced, proper and consistent PPDT, with the same properties 
as a (non-probabilistic) parsing strategy, and in addition: 

• For each complete derivation d and each complete computation c such that 
/(c) = d, pg{d) equals pa{c). 

In other words, a complete computation has the same probability as the complete 
derivation that it is mapped to by function /. An implication of this property is 
that for each string w € X*, the probabilities assigned to that string by {G,pg) 
and {A, pa) are equal. 

We say that probabilistic parsing strategy S' is an extension of parsing 
strategy S if for each reduced CFG Q and probability function pg we have 
S{Q) = {A, /) if and only if S'{Q,pg) = {A,pa, /) for some pA- 

In the following sections we will investigate which parsing strategies can be 
extended to become probabilistic parsing strategies. 

4 Correct-prefix property 

c 

For a given PDT, we say a computation c is dead if {Xinit,wi, e) h* (a, e, ui), for 
some a ^ Q*, wi ^ XI and vi ^ X*, and there are no W2 G T'* and V2 G X* such 
that {a^W2,e) h* {Xfinai,e,V2)- Informally, a dead computation is a computation 
that cannot be continued to become a complete computation. 

We say that a PDT has the correct-prefix property (CPP) if it does not allow 
any dead computations. We say that a parsing strategy has the CPP if it maps 
each reduced CFG to a PDT that has the GPP. 

In this section we show that the correct-prehx property is a necessary condi¬ 
tion for extending a parsing strategy to a probabilistic parsing strategy. For this 
we need two lemmas. 

Lemma 1 For each reduced CFG Q, there is a probability function pg such that 
PCFG {G,pg) is proper and consistent, and pg{d) > 0 for all complete deriva¬ 
tions d. 

Proof. Since G is reduced, there is a finite set L consisting of complete deriva¬ 
tions d, such that for each rule tt in G there is at least one d G L in which tt 
occurs. Let be the number of occurrences of rule vr in derivation d G L, and 
let njr be the total number of occurrences of vr in L. Let ua be the 

sum of u-TT for all rules vr with A in the left-hand side. A probability function pg 
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can be defined through ‘maximum-likelihood estimation’ such that pg(vr) = ^ 
for each rule tt = A —>■ a. 

For all nonterminals A, 




=A- 


^7r=A- 


y^TT_ 

nA~nA ~ ’ 


which means 


that the PCFG {G,Pg) is proper. Furthermore, |ld.| has shown that a PCFG 
{G,pg) is consistent if pg was obtained by maximum-likelihood estimation using 
a set of derivations. Finally, since Utt > 0 for each vr, also pg('/r) > 0 for each vr, 
and pg{d) > 0 for all complete derivations d. ■ 

We say a computation is a shortest dead computation if it is dead and none of 
its proper prefixes is dead. Note that each dead computation has a unique prefix 
that is a shortest dead computation. For a PDT A, let 7^ be the union of the 
set of all complete computations and the set of all shortest dead computations. 


Lemma 2 For each proper PPDT {A,pa), '^cgTa Pa{c) < 1- 

Proof. The proof is a trivial variant of the proof that for a proper PGFG {G,Pg), 
the sum of pg{d) for all derivations d cannot exceed 1 , which is shown by [Q]. ■ 

From this, the main result of this section follows. 


Theorem 3 A parsing strategy that laeks the CPP cannot be extended to become 
a probabilistic parsing strategy. 


Proof. Take a parsing strategy S that does not have the GPP. Then there is a 
reduced GFG G = (Ti, N, S, R), with S{G) = {A, f) for some A and /, and a 
shortest dead computation c allowed by A. 

It follows from Lemma || that there is a probability function pg such that 
{G,pg) is a proper and consistent PGFG and pg{d) > 0 for all complete deriva¬ 
tions d. Assume we also have a probability function p _4 such that {A,p_a) is a 
proper and consistent PPDT that assigns the same probabilities to strings over 
Si as {G,Pg). Since A is reduced, each transition r must occur in some com¬ 
plete computation c'. Furthermore, for each complete computation c' there is a 
complete derivation d such that f{c') = d, and pyi(c') = pg{d) > 0. Therefore, 
Pa{t) > 0 for each transition r, and p^(c) > 0, where c is the above-mentioned 
dead computation. 

Due to Lemma |, 1 > S^/gr^ Pyt(c') > S^gi:* pa{w) + PAic) > 

Pa{w) = S^gji* Pg{w). This is in contradiction with the consistency of 
{G,pg). Hence, a probability function pg with the properties we required above 
cannot exist, and therefore S cannot be extended to become a probabilistic pars¬ 
ing strategy. ■ 


5 Strong predictiveness 

For a fixed PDT, we define the binary relation on stack symbols by: Y Y' 
if and only if {Y, w, e) h* (Y', e, v) for some tc G T* and v € Y*. In other words, 
some subcomputation may start with stack Y and end with stack Y'. Note that 
all stacks that occur in such a subcomputation must have height of 1 or more. 
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We say that a PDT has the strong predictiveness property (SPP) if the exis¬ 
tence of three transitions X i—> XY, XYi Zi and XY2 Z2 such that Y '^Yi 
and Y Y2 implies Zi = Z 2 . Informally, this means that when a subcompu¬ 
tation starts with some stack a and some push transition r, then solely on the 
basis of T we can uniquely determine what stack symbol Z\ = Z 2 will be on top 
of the stack in the first configuration with stack height equal to |a|. Another 
way of looking at it is that no information may flow from higher stack elements 
to lower stack elements that was not already predicted before these higher stack 
elements came into being, hence the term ‘strong predictiveness’.Q 

We say that a parsing strategy has the SPP if it maps each reduced CFG to 
a PDT with the SPP. 

In the previous section it was shown that we may restrict ourselves to parsing 
strategies that have the CPP. Here we show that if, in addition, a parsing strategy 
has the SPP, then it can always be extended to become a probabilistic parsing 
strategy. 

Theorem 4 Any parsing strategy that has the CPP and the SPP can be extended 
to become a probabilistic parsing strategy. 


Proof. Take a parsing strategy S that has the CPP and the SPP, and take a 
reduced PCFG {G,pg), where G = (T"!, N, S, R), and let 5(^) = {A, /), for some 
PDT A and function /. We will show that there is a probability function such 
that {A,p_a) is a PPDT and p^(c) =Pg{f{c)) for all complete computations c. 

For each stack symbol X, consider the set of transitions that are applicable 
with top-of-stack X. Remember that our normal form ensures that all such 
transitions are of the same type. Suppose this set consists of m swap transitions 
Ti = X Yi, 1 < i < m. For each i, consider all subcomputations of the form 

Ti 

{X, XiW, e) h {Yi,w, yi) h* ( Y', e, v) such that there is at least one pop transition 
of the form ZY' 1—> Z' or such that Y' = Xfinal, and define Lr^ as the set of strings 
V output by these subcomputations. We also define Lx = the set of 

all strings output by subcomputations starting with top-of-stack X, and ending 
just before a pop transition that leads to a stack with height smaller than that 
of the stack at the beginning, or ending with the final stack symbol Xfinal- 
Now define for each i (1 <i< m): 


pgjv) 

^veLx Pgiv) 

In other words, the probability of a transition is the normalized probability of the 
set of subcomputations starting with that transition, relating subcomputations 
with fragments of derivations of the PCFG. 

^ There is a property of push-down devices called faiblement predictif (weakly predictive) 
Contrary to what this name may suggest however, this property is incomparable with the 
complement of our notion of SPP. 
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These definitions are well-defined. Since A is reduced and has the CPP, the 
sets Lt-. are non-empty and thereby the denominator in the definition of PAi'^i) 
is non-zero. Furthermore, PAi^i) is clearly 1. 

Now suppose the set of transitions for X consists of m push transitions r* = 
X 1 -^ XYi, 1 < i < m. For each i, consider all subcomputations of the form 

Ti 

{X,w,e) h {XYi,w,e) h* {X',e,v) such that there is at least one pop transition 
of the form ZX' i—> Z' or X' = Xfinal-, and define L^-^, Lx and PAi^i) as we have 
done above for the swap transitions. 

Suppose the set of transitions for X consists of m pop transitions r* = YjX i—> 
Zi, 1 < i < m. Define Lx = {e}, and pa{y) = 1 for each i. To see that this is 
compatible with the condition of properness of PPDTs, note the following. Since 
we may assume A is reduced, YYi = Yj for some i and j with 1 < f, j < m, then 
there is at least one transition Yi YiX' for some X' such that X' X. Due 
to the SPP, Zi = Zj and therefore i = j. 

Finally, we define = {e}- 

C 

Take a subcomputation {X,w,e) h* {Y,e,v) such that there is at least one 
pop transition of the form ZY Y' ot Y = Xfinal- Below we will prove that: 


Pa{c) 


pgjv) _ 

^v'&Lx pg{v') 


( 2 ) 


Since a complete computation c with output v is of this form, with X = Xinu 
and Y = X final, obtain the result we required to prove Theorem where D 
denotes the set of all complete derivations of CFG Q: 


Pa{c) 


Pg{v) 

pgjfjc)) 

pg{f{v')) 

pgjfjc)) 

^deD pgjd) 

pgjfjc)) 


( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 


We have used two properties of / here. The first is that it preserves the frequencies 
of symbols from R, if considered as a mapping from output strings to derivations. 
The second property is that it can be considered as bijection from complete 
computations to derivations. Lastly we have used consistency of PCFG {Q,pg), 
meaning that Pgjd) = 1. 

For the proof of we proceed by induction on the length of c and distinguish 
three cases. 

Case 1: Consider a subcomputation c consisting of zero transitions, which 
naturally has output v = e, with only configuration {X,€,e), where there is at 
least one pop transition of the form ZX Z' or X = Xfinal- We trivially have 
= 1 and Pg(D _ ^ Pgb) _ = 1 

pg{v') pg{v') 
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Case 2: Consider a subcomputation c = Tid, where {X,XiW,e) h {Yi,w,yi) 

c' 

h* ( Y' , e, Uiv), such that there is at least one pop transition of the form ZY' i—> Z' 
or Y' = X final- The induction hypothesis states that: 


Pa{c') 


pgjv) _ 

'^v'&LY^ PgW) 


(7) 


If we combine this with the definition of we obtain; 


Pa{c) = PA{Ti)-PA{c) 

^v'&Lr^ PgW) pg{v) 

^v'&Lx pgW) ^v'gly, pgW) 

_ pgim) ■ ^v'gLy^ pg{v') pg{v) 

^v'&Lx pgW) ^v'gLy^ PgW) 

^ pgjyi) -pgiv) 

^v'gLx Pg{v') 

^ pgimv) 

^v'eLx Pgiy') 


( 8 ) 

(9) 

( 10 ) 

( 11 ) 

( 12 ) 


Case 3: Consider a subcomputation c of the form {X,w,e) h (XYi,w,e) 
h* (X'',e,v) such that there is at least one pop transition of the form ZX" i—> 
Z' or X" = Xfinal- Subcomputation c can be decomposed in a unique way as 
c = Tic'rc", consisting of an application of a push transition Tj = X i—> XYi, 

c' 

a subcomputation {Yi,wi,e) h* {Y',e,vi), an application of a pop transition 

T = XY' ^ X[, and a subcomputation (X/, W 2 -, e) h* {X”, e, ^ 2 ), where w = W 1 W 2 
and V = V1V2- This is visualized in Figure 

We can now use the induction hypothesis twice, resulting in: 

Pa{c) 

and 

Pa{c') 

If we combine this with the definition of we obtain: 


Pg{vi) 


^v[&Ly^ pgi'f^'i) 


pg{v2) 


pg{v2, 


(13) 


(14) 


Pa{c) = PAin) ■ Pa{c') ■ PA{r) ■ Pa{c") 

_ ^v'£Lr. Pgjv') Pgjvi) ^ Pg (w) 

^v'eLx Pg{'^') '^v'^&LY^ PgWi) ^r'GL^/ Pg{v'2) 

i 
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(15) 

(16) 















Figure 1; Development of the stack in the computation c = Tic'rc". 


Since A has the SPP, X'^ is unique to r* and the output strings in L^-^ are 
precisely those that can be obtained by concatenating an output string in Ly^ 
and an output string in Lx'.- Therefore Pq{v') = pg{v[v 2 ) 

_ _ i 

= ^v[eLY^ PgWi) ■ Pgiv 2 ), and 

i 


Pg{vi) ■ pg{v2) 

Pa{c) = - - -=- 

^v'eLx PgW) 

_ Pg(Wv2} 

^v'eLx Pgiv') 

pgjv) _ 

^v'eLx PgW) 


(17) 

(18) 
(19) 


This concludes the proof. ■ 

Note that the definition of p _4 in the above proof relies on the strings output 
by A. This is the main reason why we needed to consider push-down transducers 
rather than push-down automata (defined below). Now assume an appropriate 
probability function has been found such that {A,pj{) is a PPDT that assigns 
the same probabilities to computations as the given PCFG assigns to the cor¬ 
responding derivations, following the construction from the proof above. Then 
the probabilities assigned to strings over the input alphabet are also equal. We 
may subsequently ignore the output strings if the application at hand merely re¬ 
quires probabilistic recognition rather than probabilistic transduction, or in other 
words, we may simplify push-down transducers to push-down automata. 

Formally, a push-down automaton (PDA) ^ is a 5-tuple (N, Q, Xinit, Xfinal, 
A), where X is the input alphabet, and Q, Xinu, Xfinal and A are as in the 
definition of PDTs. Push and pop transitions are as before, but swap transitions 
are simplified to the form X where x € {e}US. Computations are defined as 

in the case of PDTs, except that configurations are now pairs (a, w) whereas they 


12 






were triples {a, w, v) in the case of PDTs. A probabilistic push-down automaton 
(PPDA) is a pair where ^ is a PDA and p _4 is a probability function 

subject to the same constraints as in the case of PPDTs. Since the definitions of 
CPP and SPP for PDTs did not refer to output strings, these notions carry over 
to PDAs in a straightforward way. 

We define the size of a CFG as J2{A^a)£R total number of occur¬ 

rences of terminals and nonterminals in the set of rules. Similarly, we define 
the size of a PDA as J2{a^p)eA \^P\ + S(xAY)ezi total number of 

occurrences of stack symbols and terminals in the set of transitions. 

Let A = {H, Q, Xinit, Xfinal, A) be a PDA with both CPP and SPP. We will 
now show that we can construct an equivalent CFG Q = {X, Q, Xinu, R) with 
size linear in the size of A. The rules of this grammar are the following. 

• X ^ YZ for each transition X i—> XY, where Z is the unique stack symbol 
such that there is at least one transition XY' i—> Z with T T'; 

• A —> xY for each transition A T; 

• T —> e for each stack symbol Y such that there is at least one transition 
AT ^ Z or such that Y = Xfinal- 

It is easy to see that there exists a bijection from complete computations of A 
to complete derivations of Q, preserving the recognized/derived strings. Apart 
from an additional derivation step by rule X final —> e, the complete derivations 
also have the same length as the corresponding complete computations. 

The above construction can straightforwardly be extended to probabilistic 
PDAs (PPDAs). Let {A, pa) be a PPDA with both CPP and SPP. Then we 
construct Q as above, and further define pg such that pg^n) = pa{t) for rules 
TT = A —> YZ or TT = A ^ xY that we construct out of transitions r = A i—> 
XY or T = A lA y, respectively, in the first two items above. We also define 
PgiY ^ e) = 1 for rules T —> e obtained in the third item above. If {A,pa) is 
reduced, proper and consistent then so is {G,pg)- 

This leads to the observation that parsing strategies with the CPP and the 
SPP as well as their probabilistic extensions can also be described as grammar 
transformations, as follows. A given (P)CFG is mapped to an equivalent (P)PDT 
by a (probabilistic) parsing strategy. By ignoring the output components of swap 
transitions we obtain a (P)PDA, which can be mapped to an equivalent (P)CFG 
as shown above. This observation gives rise to an extension with probabilities of 
the work on covers by |]^, 

It has been shown by [^] that there is an infinite family of languages with the 
following property. The sizes of the smallest GFGs generating those languages 
are at least quadratically larger than the sizes of the smallest equivalent PDAs. 
Note that this increase in size cannot occur if PDAs satisfy both the CPP and 
the SPP, as we have shown above. 

It is always possible to transform a PDA with the CPP but without the SPP 
to an equivalent PDA with both CPP and SPP, by a construction that increases 
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the size of the PDA considerably (at least quadratically, in the light of the above 
construction and [^]). However, such transformations in general do not preserve 
parsing strategies and therefore are of minor interest to the issues discussed in 
this paper. 

The simple relationship between PDAs with both CPP and SPP on the one 
hand and CFGs on the other can be used to carry over algorithms originally 
designed for CFGs to PDAs or PDTs. One such application is the evaluation of 
the right-hand side of equation (|^ in the proof of Theorem ^ Both the numerator 
and the denominator involve potentially infinite sets of subcomputations, and 
therefore it is not immediately clear that the proof is constructive. However, 
there are published algorithms to compute, for a given PCFG {G',Pg') that is not 
necessarily proper and a given nonterminal A, the expression Pg'i^ w), 

or rather, to approximate it with arbitrary precision; see [§, ^|. This can be used 
to compute e.g. Pgiv) in equation (||), as follows. 

The first step is to map the PDT to a CFG G' as shown above. We then 
define a function pgi that assigns probability 1 to all rules that we construct out 
of push and pop transitions. We also let pg/ assign probability pg{y) to a rule 
X —> xY that we construct out of a scan transition X ^ Y. It is easy to see 
that, for any stack symbol X, we have S^gj,^ Pgiv) = S^gs* Pg'{X =>* w). This 
allows our problem on the computations of probabilities in the right-hand side 
of equation (||) to be reduced to a problem on PCFGs, which can be solved by 
existing algorithms as discussed above. 


6 Parsing strategies with SPP 


Many well-known parsing strategies with the CPP also have the SPP, such as 
top-down parsing left-corner parsing |34] and PLR parsing |42|, the first two 
of which we will define explicitly here, whereas of the third we will merely present 
a sketch. A fourth strategy that we will discuss is a combination of left-corner 
and top-down parsing, with special computational properties. 

In order to simplify the presentation, we allow a new type of transition, with¬ 
out increasing the power of PDTs, viz. a combined push/swap transition of the 


form X XY. Such a transition can be seen as short-hand for two transitions, 
the first of the form X i—> XYx^y, where Yx^y is a new symbol not already in Q, 
and the second of the form Yx^y ^ Y. 

The first strategy we discuss is top-down parsing. For a fixed CFG grammar 
G = (Y, N, S, R), we dehne Std{G) = {A,f). Here A = {X, R, Q, [5 —• a], 
[S' —> cr •], A), where Q = {[A —> a • /3] | (A —> a/3) G R}; these ‘dotted rules’ 
are well-known from |^, Q . The transitions in A are: 


• [A —> a • a/3] [A ^ aa • (5] for each rule A ^ aa/3; 

• [A ^ a • Bf5] K [A ^ a • Bf3] [H ^ • 7 ] for each pair of rules A ^ aBf5 
and TT = H ^ 7 ; 
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• [A ^ a • Bj3] [i? —> 7 •] I—> ^ aB • /?]. 

The function / is the identity function on strings over R. If seen as a function on 
computations, then / is a bijection from complete computations of A to complete 
derivations of Q, as required by the definition of ‘parsing strategy’. 

If Q is reduced, then A clearly has the CPP. That it also has the SPP can be 
argued as follows. Let us first remark that [A ^ a • (3] X for some stack 
symbols \A ^ a • (3] and X, then X must be of the form [A —> ay • 5], for 
some 7 and 5 such that yd = (3. Now, if there are three transitions X i—!■ XY, 
XYi 1 -^ Zi and XY 2 1 —> Z 2 such that Y '^Yi and Y '^Y 2 , then X must be of the 
form [A ^ a • BI3] and Y of the form [B ^ • 'j] (strictly speaking [B ^ • yJe^Tr)) 
Yi and Y 2 must both be [i? —> 7 •], and Zi and Z 2 must both be [A —> aB • (3], 
Hence the SPP is satisfied. 

Since Std has both CPP and SPP, we may apply Theorem ^ to conclude 
that Std can be extended to become a probabilistic parsing strategy. A direct 
construction of a top-down PPDT from a PCFG {G,pg) is obtained by extending 
the above construction such that probability 1 is assigned to all transitions pro¬ 
duced by the first and third items, and probability pg{T^) is assigned to transitions 
produced by the second item. 

The second strategy we discuss is left-corner (LC) parsing p^ . For a fixed 
CFG Q = {X, N, S, R), we define the binary relation Z over X U N by: XIA if 
and only if there is an a £ {XUN)* such that {A —> Xa) G R, where X ^ XUN. 
We define the binary relation Z* to be the reflexive and transitive closure of Z. 
This implies that aZ*a for all a G X. 

We now define Slc{G) = {A,f). Here A = {X, R U {H}, Q, [S' —> • it], 
[5 —> fj •], Z\), where Q contains stack symbols of the form [A —> a • /3] where 
{A — > af3) G R such that a / e V A = S', and stack symbols of the form [A ^ a • 
y/3; X] where (A ^ aY(3) G R and X,Y G X U N such that a ^ eM A = S and 
XL*Y. The latter type of stack symbol indicates that left corner X of goal Y in 
the right-hand side of rule A —> aYf3 has just been recognized. The transitions 
in A are: 

• [A —> a • Y(3] ^ [A —> a • Yf3] a] for each rule A ^ aYf3 and a G X such 
that a / e V A = S' and aZ*T; 

• [A ^ a • B(3] K [A ^ a • Bf3] C] for each pair of rules A ^ aB(3 and 
TT = C ^ e such that a 7 ^ e V A = S and Cl*B; 

• [A —> a • H/3; X] ^ [A ^ a • Bf3] X] [C —> A • 7 ] for each pair of rules 
A —> aBj3 and vr = C ^ Ay such that a 7 ^ e V A = S' and Cl*B-, 

• [A ^ a • B(3] A] [C Ay •] 1 -^ [A ^ a • B/3\ C] for each pair of rules 
A —> aBf3 and C —> Ay such that a 7 ^ e V A = S' and C/.*B; 

• [A —> a • Y/3; T] ^ [A ^ aY • /3] for each rule A —aY/I such that a 7 ^ 
eV A = 5. 
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f{v) = d 

where 

{d,() = fLcie,v) 
fLc{d,TTVo) = {d" ,v") 

where 

I is such that tt = A —> XXi ■ ■ ■ Xi or 
7r = ^^eAZ = 0 

{di,vi) = if Xi e ifi then (e,Uo) else fLc{e,Vo) 

{di,vi) = [i Xi G ifi then (e,Ui_i) else 
d' = nddi ■ ■ ■ di 

{d",v") = if -\v' = vi then {d',v') else fLcid',vi) 
Figure 2: Function / for Sic- 


The function / has to rearrange an output string to obtain a complete derivation. 
To make this possible, the output alphabet contains the symbol H in addition to 
rules from R. This symbol is used to mark the end of an upward path of nodes 
in the parse tree each of which, except the last, is the left-most daughter node of 
its mother node. As explained in |31], in the absence of such a symbol, it would 
be impossible to uniquely identify output strings with derivations of the input.^ 

The function / for the strategy Slc is dehned by Figure |^. Function / is 
dehned in terms of function fic^ which has two arguments. The first argument, 
d, is either the empty string or a subderivation that has already been constructed. 
The second argument is a suffix of the output string originally supplied as ar¬ 
gument to /. Function fic removes the hrst symbol vr from the output string, 
which will be a rule A —> XXi • ■ ■ A; or A —> e. In the former case, d must be 
e if A G Ai and d must be a subderivation from nonterminal A otherwise. The 
function is then called recursively zero or more times, once for each nonterminal 
in Ai • • • A;, to obtain more subderivations di, 1 < i < I, each of which is obtained 
by consuming a subsequent part of the output string. These sub derivations are 
combined into a larger subderivation d' = Trddi ■ ■ ■ di. Depending on the question 
whether we encounter H as the immediately following symbol of the output string, 
we return the derivation d' and the remainder v' of the output string, or call Slc 
recursively once more to obtain a larger sub derivation. 

It can be easily shown that this strategy has the CPP. Regard- 


®In [^l| , pp. 22-23] a context-free grammar is considered that consists of the set of rules 
R — {S aS, S Sb, S ^ c}. It is shown that any left-corner push-down transducer using 
only R as output alphabet would output at most one string for each input string, whereas there 
may be several derivations of the input, as the grammar is ambiguous. 
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ing the SPP, note that if there are two transitions [A—K 
[A ^ a • Bj3] X] [C ^ X • 7 ] and [A ^ a • B(3] X] Yi 1 —> Zi such that [C —> 
X • 7 ] Yi, then Yi must be [C —>• X 7 •] and Zi must he [A ^ a • Bj3]C], 
which means that Zi is uniquely determined by the first transition. 

Since Sic has both CPP and SPP, left-corner parsing can be extended to 
become a probabilistic parsing strategy. A direct construction of probabilistic 
left-corner parsers from PCFGs has been presented by |]45|| . 

Since at most two rules occur in each of the items above, the size of a (proba¬ 
bilistic) left-corner parser is 0{\G\'^), where \Q\ denotes the size of Q. This is the 
same complexity as that of the direct construction by |45|. This is in contrast 
to a construction of ‘shift-reduce’ PPDAs out of PCFGs from |]l[], which were of 
size 0(|^|^)-3 The “conjecture that there is no concise translation of PCFGs into 
shift-reduce PPDAs” from Q is made less significant by the earlier construction 
by [^1 and our construction above. It must be noted however that the ‘shift- 
reduce’ model adhered to by |]I| is more restrictive than the PDT models adhered 
to by |4^ and by us. 

When we look at upper bounds on the sizes of PPDAs (or PPDTs) that 
describe the same probability distributations as given PCFGs, and compare these 
with the upper bounds for (non-probabilistic) PDAs (or PDTs) for given CFGs, 
we can make the following observation. Theorem |3| states that parsing strategies 
without the CPP cannot be extended to become probabilistic. Furthermore, 
[^4| has shown that for certain fixed languages the smallest PDAs without the 
CPP are much smaller than the smallest PDAs with the CPP. It may therefore 
appear that probabilistic PDAs are in general larger than non-probabilistic ones. 
However, the automata studied by [^| pertain to very specific languages, and at 
this point there is little reason to believe that the demonstrated results for these 
languages carry over to any reasonable strategy for general CFGs. 

The third parsing strategy that we discuss is PLR parsing [^. Since it is 
very similar to LC parsing, we merely provide a sketch. The stack symbols for 
PLR parsing are like those for LC parsing, except that the parts of rules following 
the dot are omitted. Thus, instead of symbols of the form [A —> a • /?] and of 
the form [A ^ a • fi]X], a PLR parser manipulates stack symbols [A a] 
and [A —> a;X], respectively. That (3 is omitted means that PLR parsers may 
postpone commitment to one from two similar rules A —> af3 and A —> a/3' until 
the point is reached where (3 and (3' differ. In this sense PLR parsing is less 
predictive than LC parsing, although it still satisfies the strong predictiveness 


^This construction consisted of a transformation to Chomsky normal form followed by a 
transformation to Greibach normal form (CNF) Its worse-case time complexity, established 
in p.c. with David McAllester, is reached for a family of CFGs (t?n)„> 2 , dehned by Qn — 
({oi,..., an}, {Ai,..., An}, Ai, R), where R contains the rules Ai —> Ai+i, for 1 < i < n — 1, 
An —> Ai, and Ai —> Ai Ai and Ai —>■ ai, for 1 < i < n. After transformation to GNF, the 
grammar contains n® rules of the form Ai^ /Ai^ ai^ Ai^ /Ai^ Ai^/Ai^, with 1 < ii, * 2 , L, * 4 , *5 < 
n. In ^ a more economical transformation to Greibach normal form is given; straightforward 
extension to probabilities leads to probabilistic parsers of the type considered by of size 


17 







property, so that it can be extended to become probabilistic. 

There are two minor differences between the transitions of LC parsers and 
those of PLR parsers. The first is the simplification of stack symbols as explained 
above. The second is that for PLR, output of a rule is delayed until it is completely 
recognized. The resulting output strings are right-most derivations in reverse, 
which requires different functions / than in the case of LC parsing. Note that 
right-most derivations can be effectively mapped to corresponding parse trees, 
and parse trees can be effectively mapped to corresponding left-most derivations. 
Hence the required functions / clearly exist. 

The last strategy to be discussed in this section is a combination of left-corner 
and top-down parsing. It has the special property that, provided the fixed CFG is 
acyclic, the length of computations is bounded by a linear function on the length 
of the input, which means that the parser cannot ‘loop’ on any input. Note that if 
the grammar is not acyclic, computations of unbounded length cannot be avoided 
by any parsing strategy. From this perspective, this parsing strategy, which we 
will call e-LC parsing, is optimal. It is based on |^|, and a related idea for LR 
parsing was described by |2^. The special termination properties of this strategy 
will be needed in Section 

We hrst define the binary relation over E \J N by: XL^^A if and only if 
there are a,l3 (li U N)* such that {A aX(5) € R and a =^* e. Relation Le 
differs from the relation L defined earlier in that epsilon-generating nonterminals 
at the beginning of a rule may be ignored. 

The stack symbols are now of the form [H —>■ a • /3, /r • i^] or of the form 
—> a • y/3, /i • I'^X], Similar to the stack symbols for pure LC parsing, we have 
a e 'V A = S and X/.*Y. Different is the additional dotted expression n • v, 
which is such that is a string of epsilon-generating nonterminals, occurring 
at the beginning of the right-hand side of a rule A —> ^vaf3 or H —> fii'aYfd, 
respectively. The string will be ignored in the part of the strategy that 
behaves like left-corner parsing, where /r = e. However, when the dot of the 
first dotted expression is at the end, i.e., when we obtain a stack symbol of 
the form [A —> a •,• v], then top-down parsing will be activated to retrieve 
epsilon-generating subderivations for the nonterminals in z^, and the dot will 
move through v from left to right 

We have Xinu = [S' —• a, •] and Xfinal = [S —> a •,•], where for technical 
reasons, and without loss of generality, we assume that a does not contain any 
epsilon-generating nonterminals. Next to the symbols from R and the symbol 
H, the output alphabet also includes the set of integers {0 ,... ,Z — 1}, where 
/ = |a:| for a rule (H ^ a) G i? of maximal length; the purpose of such integers 
will become clear below. For the dehnition of the set of transitions, we will be 
less precise than for Std and Sic, to prevent cluttering up the presentation with 
details. We point out however that in order to produce a reduced PDT from a 
reduced CFG, further side conditions are needed for all items below: 


^Although such subderivations can also be pre-compiled during construction of the PDT, we 
refrain from doing so since this could lead to a PDT of exponential size. 


18 




• [A ^ a • Yf3, • \A ^ a • Yf3, • //; o] for a G such that al*Y; 

• [A ^ a • Bj3, • //] [A —> a • Bj3, • /r; C] for vr = C —> e such that Cl*B; 

• [A —> a • Bf3, • //; X] [^ —> a • Bj3, • //; X] [C —> X • 7 , • /r'] for vr = 

C —* n'Xj such that C l*B and =^* e, where m = |^'|; 

• [^ —> a • Bf3, • iJ,] X] [C —> X 7 •, //' •] i-H’ [A —> a • 5/3, • /r; C]; 

• [A ^ a • Y/3, • /r; y] s' [X —> aY • /3, • /r]; 

• > a •, /i • 5z/] K [A —> a •, /X • 5zx] [5 —> •, • |u'] for vr = 5 ^ /x' such 

that fi' =^* e; 

• [A — a •, /X • 5xx] [5 n' •] [A ^ a •, ;u5 • z/]. 

The first five items are almost identical to the hve items we presented for 

Slcj except that strings /x of epsilon-generating nonterminals at the beginning 
of rules are ignored. The length m of a string /x is output just after the relevant 
grammar rule is output, in the second and third items. This length m will be 
needed to define function / below. 

The last two items follow a top-down strategy, but only for epsilon-generating 
rules. The produced transitions do what was deferred by the left-corner part of 
the strategy: they construct subderivations for the epsilon-generating nontermi¬ 
nals in strings /x. 

The function /, which produces a complete derivation from an output string, 
is dehned through two auxiliary functions, viz. fe-LC for the left-corner part and 
fe-TD for the top-down part, as shown in Figure ||. 

The function fe-ic is similar to fic defined in Figure The main difference 
is that now sub derivations deriving e for the first m nonterminals in the right- 
hand side of a rule are obtained by calls of the function fe-TD- For a suffix v of an 
output string, fe-Tviv) yields a pair (vrdi ■ ■ ■ di,vi) such that v = 7r(iifi2 ■ • • divi. In 
other words, fe-TD does nothing more than split its argument into two parts. The 
length of the first part vrdi ■ ■ ■ di depends on the length I of the right-hand side of 
rule vr and on the lengths of right-hand sides of rules that are visited recursively. 

It can be easily seen that Se-ic has both CPP and SPP. The size of a produced 
PDT is now 0{\Q\^), rather than 0{\Q\‘^) as in the case of Sic- 


7 Parsing strategies without SPP 


In this section we show that the absence of the strong predictiveness property 
may mean that a parsing strategy with the CPP cannot be extended to become a 
probabilistic parsing strategy. We first illustrate this for LR(0) parsing, formal¬ 
ized as a parsing strategy Sir-, which has the CPP but not the SPP, as we will 


see. We assume the reader is familiar with LR parsing; see [R|. 
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f{v) = d 

where 

{d, e) = fe-Lcie, v) 
fe-Lc{d,TTmvo) = {d" ,v") 

where 

I is such that vr = A —> • • • BmXXi ■ ■ ■ Xi or 

7r = yl^eA/ = 0 

{di,vi) = if Xi G X:, then (e,Uo) else fe-Lcie,Vo) 

{di,vi) =ii Xi e then (e,-u«_i) else fe-Lc{e,vi-:,) 
{d[,vi+i) = fe-TD{vi) 

{d^^Vij^jn) /e-TD(^i+m—l) 

d' = Trd'i ■ ■ ■ d'j^ddi ■ ■ ■ di 

{d",v") = if -\v' = Vl+rn then {d',v') else fe-Lc{d',Vl+rn) 
fe-TD{v) = {irdi-■ ■ duvi) 
where 

TTUo = V 

I is such that n = A ^ Bi ■ ■ ■ Bi 

{dl,Vi) = fe-TD{vo) 

{dl,Vl) = fe-TD{vi-i) 

Figure 3; Function / for S^-lc- 
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We take a PCFG {G,Pg) defined by: 


7r5=5- 

A AR, 

pgi'^s) = 1 

TT^i = A - 

aC, 

Pgi^^Ai) = j 

7rA2 = A - 

-> aD, 

pgi-^A^) = i 

TTBi = B - 

A bC, 

pgi^Br) = i 

^TB2 = B - 

A bD, 

pg{T^B2) = -■ 

TTC = C- 

-> xc, 

pg{T^c) = 1 

ttd = D - 

-> xd, 

PgiT^o) = 1 


Note that this grammar generates a finite language. 

We will not present the entire LR automaton A, with Slr{G) = (A,/) for 
some /, but we merely mention two of its key transitions, which represent shift 
actions over c and d: 

Tc = {C D^x»d}^ {C -^x»c, D^x»d} {C —> xc •} 

Td = {C D —{C —>x»c, {D xd •} 


However, 


(We denote LR states by their sets of kernel items, as usual.) 

Take a probability function pj\^ such that {A,pX) is a proper PPDT. R can be 
easily seen that must assign 1 to all transitions except Tc and r^, since that 
is the only pair of distinct transitions that can be applied for one and the same 
top-of-stack symbol, viz. {C —>x*c, d}. 

PQ{axchxd) _ _ 1 ^ x p^jaxcbxd) _ 

pg{axdbxc) ~ Pg('^A2)-PQ('^BG) ~ ~ pA{axdbxc) ~ 

PAi-rd)-PAi'r ) ~ ^ ^ I' This shows that there is no pj, such that (A,pj,) as¬ 
signs the same probabilities to strings over U as {G,pg). R follows that the LR 
strategy cannot be extended to become a probabilistic parsing strategy. 

Note that for G as above, Pg^TTAi) PgiT^Bi) can be freely chosen, and 
this choice determines the other values of pg, so we have two free parameters. 
For A however, there is only one free parameter in the choice of p^- This is 
in conflict with an underlying assumption of existing work on probabilistic LR 
parsing, by e.g. Q and ||l^, viz. that LR parsers would allow more fine-grained 
probability distributions than CFGs. However, for some practical grammars from 
the area of natural language processing, | ^ ] has shown that LR parsers do allow 
more accurate probability distributions than the CFGs from which they were 
constructed, if probability functions are estimated from corpora. 

By way of Theorem Q, it follows indirectly from the above that LR parsing 
lacks the SPP. For the somewhat simpler ELR parsing strategy, to be discussed 
next, we will give a direct explanation of why it lacks the SPP. A direct expla¬ 
nation for LR parsing is much more involved and therefore is not reported here, 
although the argument is essentially of the same nature as the one we discuss for 
ELR parsing. 

The ELR parsing strategy is not as well-known as LR parsing. R was orig¬ 
inally formulated as a parsing strategy for extended CEGs [I 


23 1 , but its re- 
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striction to normal CFGs is interesting in its own right, as argued by |27|. ELR 
parsing for CFGs is also related to the tabular algorithm from 

Concerning the representation of right-hand sides of rules, stack symbols for 
ELR parsing are similar to those for PLR parsing: only the part of a right- 
hand side is represented that consists of the grammar symbols that have been 
processed. Different from LG and PLR parsing is however that a stack symbol 
for ELR parsing contains a set consisting of one or more nonterminals from the 
left-hand sides of pairwise similar rules, rather than a single such nonterminal. 
This allows the commitment to certain rules, and in particular to their left-hand 
sides, to be postponed even longer than for LG and PLR parsing. 

Thus, for a given CFG Q = {U, N, S, R), we construct a pair Selr{Q) = 
{A,f). Here A = (11, R, Q, [{S'} ^ e], [{Sj ^ a], Z\), where Q is a subset of 
{[r ^ a] I r c A^aVH G r3p[{A af3) e R]} U {[r ^ a] B] \ r Q N A 
VH G r3p[{A ^ a/3) G R A R G N]}. 

We provide simultaneous inductive definitions of Q and A: 


. [{S} ^ e] G Q; 

• For [T —> a] G Q, rule A —> aYf3 and a G H such that A ^ F and al*Y, 

let [r -I- a; a] ^ Q and [F ^ a] ^ [F a; o] G A; 

• For [T —> a] G Q, rules A —> aR/3 and vr = C ^ e such that A € F and 

Cl*B, let [r ^ a; C] G Q and [F a] tA [T ^ a; C] G Z\; 

• For [F, ^ a; X] G Q and F^ = {C \ 3{A aB(5) G R[H G A A C ^ X 7 A 

CL*B]} A 0, let [A ^ X] G Q and [F, ^a;X]^ [F, a; X] [F^ X] e 

A-, 

• For [Ti ^ Oi;X], [F^ —> X7] G Q, rules A — aBP and tt = C ^ Xj 

such that A G F^, C G F^ and Cl*B, let [Ti —> a; C] G Q and 

[A ^ a; X] [A ^ XA ^ [A ^ a; C] G Z\; 

• For [F, ^ a] Y] £ Q and F^ = {A £ F, \ 3/3[(H ^ aYP) £ R]} A 0, let 
[F^ -^aY]£Q and [F, ^ a; T] ^ [A ^ aT] G Z\. 

Note that the last five items are very similar to the five items for LG parsing. 
In the second last item, we have assumed the availability of combined pop/swap 
transitions of the form XY fA Z. Such a transition can be seen as short-hand 
for two transitions, the first of the form XY 1 —> Z^^y^ where Z^^y is a new symbol 
not already in Q, and the second of the form Z^^y ^ Z. 

The function / is defined as in the case of PLR parsing, and turns a complete 
right-most derivation in reverse into a complete derivation. 

ELR parsing has the GPP but, like LR parsing, it lacks the SPP. The prob¬ 
lem is caused by transitions of the form [F^—ta\X] [R 2 —> H 7 ] [Bi (y-\C]. 
Intuitively, a subcomputation that recognizes 7 , directly after recognition of X, 
only commits to a choice of the left-hand side nonterminal C from R 2 after 7 has 
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[{5} ^ e] K [{5} ^ e; a] 

[{S} ^ e; a] [{S} e; a] [{A} ^ a] 
[{A}^a]^^[{A}^a-x] 

[{^} —> a; x] t—>• [{j 4} —> a; x] [{C, D} x] 
Tc = [{C, D} ^ x]^ [{C, D} x; c] 

Td = [{C,D}^x]^[{C,D}^x-d] 

[{C, D} —> x; c] I—> [{C} —> xc] 


[{^}- 

a;x] [{C} ^ xc] [{^} —> 

a;C] 

[{^}- 

a; C] 1 —> [{tL} —> aC] 


[{^}- 

e;a] [{A} ^ aC] [{5}- 

> e;A] 

[{C,D} 

—> x; d] 1 —>■ [{L*} —> xd] 


[{^}- 

a;x] [{L)} —> xd] [{^} —; 

■ a;D] 

[{^}- 

a; D] 1 —> [{A} —> aD] 


[m- 

e;a] [{A} ^ aD] [{5}- 

^ e;^] 

[{5}- 

e; A] 1 —> [{S'} —>■ A] 


[{5}- 

A]!^[{S}^A;b] 


[m- 

A-b]^[{S}^A-,b] [{B}^ 

b] 

[m- 

6 ]K[{R}^6 ;x] 


[m- 

6;x] [{B} b-x] [{C,D} 

—>■ x] 

[m- 

b] x] [{C} —>• xc] [{S} ^ 

b;C] 

[m- 

6; C] ^ [{B} bC] 


[{s}- 

A;b] m^bC] [{S}^A;B] 

[{Sl¬ 

6; x] [{D} ^ xd] [{B} ^ 

^b-D] 

im- 

b;D] ^ [{B} bD] 


m- 

A-b] m^bD] [{S}- 

-^A;B] 


A;B] ^ [{S} ^ AB] 



Figure 4: Transitions for ELR parsing strategy. 


been completely recognized, and this choice is communicated to lower areas of 
the stack through this pop transition. 

That ELR parsing can indeed not be extended to a probabilistic parsing 
strategy can be shown by considering the same CFG as above. From the set of 
transitions, shown in Fignre ^ we restrict onr attention to the following two: 

t, = [{C,D]^x\"^[{C,D]^x-c\ 

Td = [{C,D}^x]t^[{C,D}^x-d] 


This is the only pair of transitions that can be applied for one and the same 
top-of-stack. The rest of the proof is identical to that in the case of LR parsing. 

Problems with the extension of ELR parsing to become a probabilistic parsing 
strategy have been pointed out before by [46|, who furthermore proposed an alter- 
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native type of probabilistic push-down automaton that is capable of computing 
multiple probabilities for each sub derivation. However, since a transition of such 
an automaton may perform an unbounded number of elementary computations 
on probabilities, we feel this automaton model cannot realistically express the 
behaviour of probabilistic parsers, and therefore it will not be considered further 
here. 


8 Extension in the wide sense 


The main result from the previous section is that, in general, there is no construc¬ 
tion of probabilistic LR parsers from PCFGs such that, firstly, a probabilistic LR 
parser has the same set of transitions as the LR parser that would be constructed 
from the CFG in the non-probabilistic case and, secondly, the probabilistic LR 
parser has the same probability distribution as the given PCFG. 


There is a construction proposed by |M, that operates under different 

assumptions. In particular, a probabilistic LR parser constructed from a certain 
PCFG may possess several ‘copies’ of one and the same LR state from the (non- 
probabilistic) LR parser constructed from the CFG, each annotated with some 
additional information to distinguish it from other copies of the same LR state. 
Each such copy behaves as the corresponding LR state from the LR parser if we 
neglect probabilities. Transitions may however obtain different probabilities if 
they operate on different copies of identical LR states, based on the additional 
information attached to the LR states. 

By this construction, there are many PCFGs for which one may obtain a 
probabilistic LR parser that describes the same probability distribution. This 
even holds for the PCFG we discussed in the previous section, although we have 
shown that a probabilistic LR parser without an extended LR state set could not 
describe the same probability distribution. A serious problem with this approach 
is however that the required number of copies of each LR state is potentially 
infinite. 

In this section we formulate these observations in terms of general parsing 
strategies and a wider notion of extension to probabilistic parsing strategies. We 
also show that the above-mentioned problem with infinite numbers of states is 
inherent in LR parsing, rather than due to the particular construction of LR 
parsers from PCFGs by [|^, ^] . 

We first introduce some auxiliary notation and terminology. Let A and A' be 
two PDTs and let g be a function mapping the stack symbols of A! to the stack 
symbols of A. If r is a transition of the form X XY, YX Z oi X Y 
from A!, then we let ^(r) denote a transition of the form g{X) i—> g(X)g(Y), 
g{Y)g{X) I—> g{Z) or g{X) ^ giY), respectively. This effectively extends g to 
a function from transitions to transitions. Note that a transition ^(r) may, but 
need not be a transition from A. In the same vein, we extend g to a, function 
from computations of to sequences of transitions (which may, but need not be 
computations of A), by applying g element-wise as a function on transitions. 
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For PDTs A = (i^i, Q, Xi^u, Xfinal-, A) and A' = {X[, T', Q', XX^, 
Xfi^^i, A'), we say A! is an expansion of A ii X'^ = X'^ = X-^ and there is a 

function g such that: 

• 5 is a surjective function from Q' to Q. 

• Extended to transitions, s' is a surjective function from A' to A. 


• Extended to computations, s is a bijective function from the set of compu¬ 
tations of A' to the set of computations of A. 

In other words, for each stack symbol from Q, Q' may contain one or more cor¬ 
responding stack symbols. The language that is accepted and the output strings 
that are produced for given input strings remain the same however. Furthermore, 
that s is a bijection on computations implies that the behaviour of the two au¬ 
tomata is identical in terms of e.g. the length of computations and the amount 
of nondeterminism encountered within those computations. 

To illustrate these definitions, assume we have an arbitrary PDT A. We 
construct a second PDT A! that is an expansion of A. It has the same input and 
output alphabets, and for each stack symbol X from A, A! has two stack symbols 
{X, 0) and (X, 1). A second component 0 signifies that the distance of the stack 
symbol to the bottom of the stack is even, and 1 that it is odd. Naturally, if Xinu 
and Xfinal are the initial and final stack symbols of A, we choose the initial and 
final stack symbols of A! to be and {Xfinai,0), as they have distance 0 

to the bottom of the stack. For each transition of the form X i— > XY, YX i— > Z 
or X Y from A, we let A! have the transitions {X,i) 

(y, i)(X, I — i) I— > [ZA) or {XA) ^ 0^A)i respectively, for both i = 0 and i = 1. 
Obviously, the function g mapping stack symbols from A\' to stack symbols from 
A is given by g{{X, i)) = X for all X and i G {0,1}. 

We now come to the central dehnition of this section. We say that probabilistic 
parsing strategy S' is an extension in the wide sense of parsing strategy S if for 
each reduced CFG Q and probability function pg we have S(S) = (A, f) if and 
only if S'{Q,pg) = {A' ,pj\,i, f) for some A' that is an expansion of A and some 
p_ 4 '. This definition allows more probabilistic parsing strategies S' to be related 
to a given strategy S than the definition of extension from Section 

LR parsing however, which we know can not be extended to a probabilistic 
strategy in the narrow sense from Section P, can neither be extended in the wide 
sense to a probabilistic parsing strategy. To prove this, consider the following 


PCFG {Q,pg), taken from |M] with minor modifications: 


vr^ 

= 5 


pgi^s) = 1 


= A 

^B, 

Pgi'^A:,) = 

'^A2 

= A 

^C, 

pg{'^A2) = 


= B 

aB, 

= 

'^B2 

= B 


pg{'^B2) = 

TTCi 

= C 

aC, 

pgi'^ci) = ■ 

TTCa 

= C 

c. 

pg{'^C2) = ■ 
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The CFG Q generates strings of the form a^b and a^c for any n > 0. Observe 




Let A be such that Slr{Q) = {A, f) and consider input strings of the form 
a”6 and a^c, n > 1. After scanning the first n symbols, A reaches a configuration 
where the top-of-stack X is given by the set of (kernel) items: 


X = {B 


B,C ^ a* C} 


There are three applicable transitions, representing shift actions over a, b and 
c, given by; 

Ta=X^X X 
n = xt^ X {B ^b»} 

Tc=X^X {C ^c»} 

After reading b or c, the remaining transitions are fully deterministic. 

For a PDT M that is an expansion of A, we may have different stack symbols 
that are all mapped to X by function g. These stack symbols can be referred to 
as Xn, which occur as top-of-stack after scanning the first n symbols of a^b or 
a^c, n > 1. We refer to the applicable transitions with top-of-stack Xn as: 

Ta,n — ^ ^n+l 

Tb^n = Xn^ Xn {B —>b •}„ 

Tc,n — I > Xn {C > C •Ifj 

for certain stack symbols {B — > b •}„ and {C —> c •}n that g maps to {B ^ b •} 
and {C —> c •}, respectively. 

Now let us assume we have a probability function such that {A',py^i) is a 
PPDT. Since the application of either Tb^n or rc,n is the only nondeterministic step 
that distinguishes recognition of a^b from recognition of a^c, n > 1, it follows 
that assigns the same probabilities to strings over 

alphabet {a,6,c} as iG,pg), then must be equal to = (^) for 

/,\n-l 

each n > 1. Since ( 2 ) is a different value for each n however, this would 
require A! to possess infinitely many stack symbols, which is in conflict with the 
definition of push-down transducers. 

This shows that no probability function pj^ exists for any expansion A! of 
A such that {A', PA') assigns the same probabilities to strings over the alphabet 
as {G,Pg), and therefore LR parsing cannot be extended in the wide sense to 
become a probabilistic parsing strategy. With only minor changes to the proof, 
the same can be shown for ELR parsing. 


9 Prefix probabilities 

In this section we show that the behaviour of PPDTs on input can be simulated by 
dynamic programming. We also show how dynamic programming can be used for 


26 



computing prefix probabilities. Prefix probabilities have important applications, 
e.g. in the area of speech recognition. 

Our algorithm is a minor extension of an application of dynamic programming 
developed for non-probabilistic PDTs by |^, ||], and the treatment of probabilities 
is derived from [ 44 |. 

Assume a fixed PPDT {A,p_a) and a fixed input string oi • • • a^. Consider 


Cl 

a computation of the form circ2, where {Xinit,ai ■ ■ ■ ai,e) h* (aX, e,vi), r is of 

C2 

the form X i—> XY', and {Y',ai+i ■■■aj,e) h* {Y,e,V2), for some stack symbols 
X, Y', Y, some input positions i and j (0 < i j < n), and some output strings vi 
and V2- In words, the computation obtains top-of-stack X after scanning of Oj but 
before scanning of Oj+i, then applies a push transition, and then possibly further 
push, scan and pop transitions, which leads to Y on top of X after scanning of 
aj but before scanning of Oj+i. 

We now abstract away from some details of such a computation by just record¬ 
ing X, Y, i, j and its probability pi = p^(circ2). The probability pi is related 
to what is commonly called a forward probability, as it expresses the probabil¬ 
ity of the computation from the beginning onward.^ The existence of the above 
computation is represented by an object that we will call a table item, written as 
Pi : forward{X, Y,i,j). 

Similarly, consider a subcomputation of the form rc2, where as before r is of 


C2 

the form X i—> XY', and (W,aj+i ■■■aj,e) h* {Y,e,V2), for some stack symbols 
X,Y',Y, some input positions i and j {0 < i < j < n), and some output string 
V2- We express the existence of such a subcomputation by a different kind of table 
item, written as p2 : inner{X,Y,i,j), where p2 = Pa{tc2)- Here, p2 is related to 
what is commonly called an inner probability, as it expresses only the probability 
internally in a subcomputationj^ 

For technical reasons, we also need to consider computations c where 


{Xinit,ai ■ ■ ■ aj,€) h* (Y,e,v), for some Y, j and v. These are represented by 
table items pi : forward{Y, Y, 0 ,j), where pi = pa{c)- The symbol _L can be seen 
as an imaginary stack symbol that is located below the actual bottom-of-stack 
element. 

All table items of the above forms, and only those table items, can be derived 
by the deduction system in Figure Deduction systems for defining parsing 
algorithms have been described before by |^|; see also |^, for a very similar 
framework. A dynamic programming algorithm for such a deduction system 
incrementally fills a parse table with table items, given a grammar and input. 
During execution of the algorithm, items that are already in the table are matched 
against antecents of inference rules. If a combination of items match all antecents 
of an inference rule, then the item that matches the consequent of that inference 


^Forward probability as defined by [^] refers to the sum of the probabilities of all computa¬ 
tions from the beginning onward that lead to a certain rule occurrence, whereas here we consider 
only one computation at a time. We will turn to forward probabilities later in this section. 

^We will turn to actual inner probabilities later in this section. 
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rule is added to the table. This process ends when no more new items can be 
added to the table. 

The item in the consequent of inference rule (|^) represents the fact that at 
the beginning of any computation, Xinit lies on top of imaginary stack element T, 
no input has as yet been read, and the product of probabilities of all transitions 
used in the represented computation is 1, since no transitions have been used yet. 

Inference rule (^) derives a table item from an existing table item, if the 
second stack symbol of that existing item indicates that a push transition can 
be applied. Naturally, the probability in the new item is the product of the 
probability in the old item and the probability of the applied transition. Inference 
rule is very similar. 

Two subcomputations are combined through a pop transition by inference 
rule (^), the intuition of which can be explained as follows. If W occurs as 
top-of-stack at position i and reading the input up to j results in Y on top of 
IT, and if subsequently reading the input from j to k results in X on top of 
Y and YX may be replaced by Z by a pop transition, then reading the input 
from i to k results in Z on top of IT. The probability of the newly derived 
subcomputation is the product of three probabilities. The first is the probability 
of that subcomputation up to the point where Y is top-of-stack, which is given 
by pi] the second is the probability from this point onward, up to the point where 
X is top-of-stack, which is given by p 2 ', the third is the probability of the pop 
transition. The second of these probabilities, p 2 , is defined by the inference rules 
for ‘inner’ items to be discussed next. 

Inference rule ( p4|) starts the investigation of a new subcomputation that 
begins with a push transition. This rule does not have any antecedents, but we 
may add an item pi : forward{Z, X, i,j) as antecedent, since the resulting ‘inner’ 
items can only be useful for the computation of ‘forward’ items if at least one 
item of the form pi : forward{Z,X,i,j) exists. We will not do so however, since 
this would complicate the theoretical analysis. 

The next two rules, (^) and (p6[), are almost identical to (^) and (pS]). 

It is not difficult to see that for each complete computation of the form 

C 

{XinitiCLi ■ ■ ■ a-n, e) h* {Xfinai, e, v), for some output string v, there is precisely one 
derivation by the deduction system of some table item pi : forward (T, Xfinai j 0, n), 
where pi = p^(c). Conversely, for each derivation of such a table item, there is a 
unique corresponding computation. Computations and derivations can be easily 
related to each other by looking at the transitions in the side conditions of the 
inference rules. 

If follows that if we take the sum of pi over all derivations of items pi : 
forward{Y, Xfinal, 0, n), then we obtain the probability assigned by A to the input 
w = ai ■■■ On¬ 
Now assume that A is proper and consistent. For a given string w' € T*, 
where is the input alphabet, we define the prefix probability of w' to be 


Pa{w'w 


ff\ 
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Initialization: 


( 20 ) 


1 : forward{±, Xinit,0, 0) 


Push (forward); 


Pi : forward{Z, X,i,j) 


Pi ■ PAir) : forward{X, Y, j, j) 
Scan (forward); 

Pi : forward{Z,X,i,j) 


ii ^ 


T = X ^ XY 


Pi ■ PAir) ■ forward{Z, Y, i,j') 


T = Xf^Y 
(x = €Aj'= j) V 

(x = aj+i A / = j + 1) 


Pop (forward): 


Pi : forward[W, Y,i,j) 
P 2 : inner{Y,X,j,k) 


Pi ■ P 2 ■ PAir) ■ forwardiW, Z,i,k) 


T= YX ^ Z 


( 21 ) 


( 22 ) 


(23) 


Push (inner) 


Scan (inner): 


PAir) : inneriX,Y,j,j) 


I T = X XY 


P 2 : inneriZ, X,i, j) 

P 2 - PAir) ■ inneriZ,Y,i,f ) 


T = X ^ Y 
(x = e A / = j) V 

(x = aj+i A / = j + 1) 


Pop (inner); 


P 2 : inner (IP, Y, i, j) 
P 2 : inneriY,X,j,k) 


P 2 ■ P 2 ■ PAir) ■ inneriW, Z,i,k) 


T= YX ^ Z 


(24) 


(25) 


(26) 


Figure 5: Deduction system of table items. 


In other words, we sum the probabilities of all strings w = w'w” that start with 
prefix w'. We will now show that this probability can also be expressed in terms 
of the probabilities of ‘forward’ items. 

Assume that w' = oi • • • On, for some n > 0. Any computation on a string 
w = w'w" that is the prefix of a complete computation must be of one of two 

C 

types. The first is ai • • • e) h* for some which means 

that rc" = e, so that no input beyond position n needs to be read. The second is 

Cl r C2 

G-i ‘ • • • djnj e) h* iaX, ttn+i ■ ■ ■ ajn,vi) h (aP, 0^+2 ■■■am, viy) P* 
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{Xfinal, e, Viyv 2 ), where r is a scan transition X Y such that a = an+i- 

The sum of probabilities of computations of the first type equals the sum of 
Pi over all derivations of items pi : forward{Y, Xfinai,0,n), as we have explained 
above. For the second type of computation, properness and consistency implies 
that for given ci and r as above, the sum of probabilities of different C 2 must 
be 1. (If that sum, say q, is less than 1, then the sum of the probabilities of all 
computations cannot be more than 1 — (1 — g) • pa{c 2 ) < Ij which is in conflict 
with the assumed consistency.) Furthermore, properness implies that the sum 
of probabilities of different r that we can apply for top-of-stack X must be 1. 
Therefore, we may conclude that the sum of probabilities of computations of the 


an,e) b* 
a,y 


second type equals the sum of p^(ci) over all computations {Xinit,ai 
{aX,e,vi) such that there is at least one scan transition of the form X Y. 
This equals the sum of pi over all derivations of items pi : forward{Z, X,0,n), 
for some Z, such that there is at least one scan transition of the form X ^ Y. 

Hereby we have shown how both the probability and the prefix probability 
of a string can be expressed in terms of derivations of table items. However, 
the number of derivations of table items can be infinite. The obvious remedy 
lies in an alternative interpretation of the inference rules in Figure following 
[]T|: we regard objects of the form forward{X,Y,i, j) or inner{X, Y, i, j) as table 
items in their own right, and store each at most once in the parse table. The 
associated probabilities are then no longer those for individual derivations, but 
are the sums of probabilities over all derivations of table items forward{X, Y,i,j) 
or inner(X, Y, i, j). Such a sum of probabilities over all derivations of a table 
item is commonly called a forward or inner probability, respectively. 

We will make this more concrete, under the assumption that there are no 
cyclic dependencies, i.e., there is no item forward{X,Y,i,j) or inner{X^Y^i, j) 
that may occur as ancestor of itself in some derivation. Let T be the set of all items 
forward{X,Y,i, j) or inner{X,Y,i, j) that can be derived using the deduction 
system in Figure |5|, ignoring the probabilities. We then define a function 
from table items to probabilities, as shown in Figure We assume the function 
6 evaluates to 1 if its argument is true, and to 0 otherwise. 

Each line in the right-hand sides of the two equations in Figure can be 
seen as the backward application of an inference rule from Figure In other 
words, for a given item, we investigate all possible ways of deriving that item 
as the consequent of different inference rules with different antecedents. For 
example, the second line in the right-hand side of equation (^) , can be seen as 
the backward application of inference rule (^l|). 

That Figure is indeed equivalent to Figure follows from the fact that 
multiplication distributes over addition. If there are cyclic dependencies, then 
the set of equations in Figure may no longer have a closed-form solution, but 
we may obtain probabilities by an iterative algorithm that approximates the 
lowest non-negative solution to the equations |44|. 

Given the set of equations in Figure ^ we can now express the probability of 
a string of length n as ptab{forward{Y, Xfinai,0,n)). The prefix probability of a 
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( 27 ) 


Ptab {forward (X,Y,i,j)) = 

S(X = ± A y = Xinit Ai = j = 0) + 

Hi = j) • I] Ptab{forward{Z,X,k,i)) ■ pa{t) + 

Z,k,T: 

forward(Z,X ,k,i)^T, 
t=X^XY 

Ptab{forward{X,Z,i,j'))-pAir) + 

Z,j' ,x,y,r: 

forward(X,Z,i,j')GT, 

(x = €Aj'=j)V{x=ajAj^=j-l), 
x.v 

T=Z 

Ptabiforward{X,W,i,k)) ■ ptab{inner{W, Z,k,j)) •p^(r) 

W.Z,k,T: 

forward{X,W’,i,k)GT, inner (W, Z ,k ,j) GT , 
t=WZ^Y 


Ptab{inner{X,Y,iJ)) = 

Hi = j)- P-a{'t) + 


(28) 


t=X^XY 


X! Ptab{inner{X,Z,i,j')) ■ pa{t) 


Z,j' ,x,y,T\ 
inner {X,Z,i,j')^T, 

{x = eAj'=j)Y{x = ajAj'=j-l), 


+ 


^ Ptab{inner{X,W,i,k)) ■ ptab{inner{W, Z,k, j)) ■ pa{t) 

W,Z,k,T: 

inner(X,W,i,k) £T,inner(W,Z,k,j)GT, 
t=WZ^Y 


Figure 6: Recursive functions to determine probabilities of table items. 


string of length n is given by: 


Ptab{forward{Y,Xfinai,0,n)) + (29) 

Ptab{forward{X,Y,i,n)) (30) 

X,Y,i: 

forward{X ,Y,i,n)GT, 
a,v 

3r,a,y,Z[r = Y Z] 


To obtain a suitable PPDT from a given PCFG, we may apply the strategy 
Se-LC from Section |^. Provided the (P)CFG is acyclic, this strategy ensures 
that there are no computations of infinite length for any given input, which 
implies there are no cyclic dependencies in the simulation of the automaton by 
the dynamic programming algorithm. 

Hereby we have presented a way to compute probabilities and prefix probabil¬ 
ities of strings. Our approach is an alternative to the one from |19, 44|, and has 
the advantage that the approach is parameterized by the parsing strategy: in¬ 
stead of Se-LC we may apply any other parsing strategy with the same properties 
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with regard to acyclic grammars. If our grammars are even more constrained, e.g. 
if they do not have epsilon rules, we may apply even simpler parsing strategies. 
Different parsing strategies may differ in the efficiency of the computation. 


10 Conclusions 

We have formalized the notion of parsing strategy as a mapping from context- 
free grammars to push-down transducers, and have investigated the extension 
to probabilities. We have shown that the question of which strategies can be 
extended to become probabilistic heavily relies on two properties, the correct- 
prefix property and the strong predictiveness property. The CPP is a necessary 
condition for extending a strategy to become a probabilistic strategy. The CPP 
and SPP together form a sufficient condition. We have shown that there is at 
least one strategy of practical interest with the CPP but without the SPP that 
cannot be extended to become a probabilistic strategy. Lastly, we have presented 
an application to prefix probabilities. 
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